36 research outputs found

    Shallow feature based dense attention network for crowd counting

    Get PDF
    While the performance of crowd counting via deep learning has been improved dramatically in the recent years, it remains an ingrained problem due to cluttered backgrounds and varying scales of people within an image. In this paper, we propose a Shallow feature based Dense Attention Network (SDANet) for crowd counting from still images, which diminishes the impact of backgrounds via involving a shallow feature based attention model, and meanwhile, captures multi-scale information via densely connecting hierarchical image features. Specifically, inspired by the observation that backgrounds and human crowds generally have noticeably different responses in shallow features, we decide to build our attention model upon shallow-feature maps, which results in accurate background-pixel detection. Moreover, considering that the most representative features of people across different scales can appear in different layers of a feature extraction network, to better keep them all, we propose to densely connect hierarchical image features of different layers and subsequently encode them for estimating crowd density. Experimental results on three benchmark datasets clearly demonstrate the superiority of SDANet when dealing with different scenarios. Particularly, on the challenging UCF CC 50 dataset, our method outperforms other existing methods by a large margin, as is evident from a remarkable 11.9% Mean Absolute Error (MAE) drop of our SDANet

    Query-as-context Pre-training for Dense Passage Retrieval

    Full text link
    Recently, methods have been developed to improve the performance of dense passage retrieval by using context-supervised pre-training. These methods simply consider two passages from the same document to be relevant, without taking into account the possibility of weakly correlated pairs. Thus, this paper proposes query-as-context pre-training, a simple yet effective pre-training technique to alleviate the issue. Query-as-context pre-training assumes that the query derived from a passage is more likely to be relevant to that passage and forms a passage-query pair. These passage-query pairs are then used in contrastive or generative context-supervised pre-training. The pre-trained models are evaluated on large-scale passage retrieval benchmarks and out-of-domain zero-shot benchmarks. Experimental results show that query-as-context pre-training brings considerable gains and meanwhile speeds up training, demonstrating its effectiveness and efficiency. Our code will be available at https://github.com/caskcsg/ir/tree/main/cotmae-qc .Comment: EMNLP 2023 Main Conferenc

    Pre-training with Large Language Model-based Document Expansion for Dense Passage Retrieval

    Full text link
    In this paper, we systematically study the potential of pre-training with Large Language Model(LLM)-based document expansion for dense passage retrieval. Concretely, we leverage the capabilities of LLMs for document expansion, i.e. query generation, and effectively transfer expanded knowledge to retrievers using pre-training strategies tailored for passage retrieval. These strategies include contrastive learning and bottlenecked query generation. Furthermore, we incorporate a curriculum learning strategy to reduce the reliance on LLM inferences. Experimental results demonstrate that pre-training with LLM-based document expansion significantly boosts the retrieval performance on large-scale web-search tasks. Our work shows strong zero-shot and out-of-domain retrieval abilities, making it more widely applicable for retrieval when initializing with no human-labeled data.Comment: 10 pages, 3 tables, 4 figures, under revie

    RepViT: Revisiting Mobile CNN From ViT Perspective

    Full text link
    Recently, lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency compared with lightweight Convolutional Neural Networks (CNNs) on resource-constrained mobile devices. This improvement is usually attributed to the multi-head self-attention module, which enables the model to learn global representations. However, the architectural disparities between lightweight ViTs and lightweight CNNs have not been adequately examined. In this study, we revisit the efficient design of lightweight CNNs and emphasize their potential for mobile devices. We incrementally enhance the mobile-friendliness of a standard lightweight CNN, specifically MobileNetV3, by integrating the efficient architectural choices of lightweight ViTs. This ends up with a new family of pure lightweight CNNs, namely RepViT. Extensive experiments show that RepViT outperforms existing state-of-the-art lightweight ViTs and exhibits favorable latency in various vision tasks. On ImageNet, RepViT achieves over 80\% top-1 accuracy with nearly 1ms latency on an iPhone 12, which is the first time for a lightweight model, to the best of our knowledge. Our largest model, RepViT-M3, obtains 81.4\% accuracy with only 1.3ms latency. The code and trained models are available at \url{https://github.com/jameslahm/RepViT}.Comment: 9 pages, 7 figure

    Learning Transformation-Invariant Local Descriptors With Low-Coupling Binary Codes

    Get PDF
    Despite the great success achieved by prevailing binary local descriptors, they are still suffering from two problems: 1) vulnerable to the geometric transformations; 2) lack of an effective treatment to the highly-correlated bits that are generated by directly applying the scheme of image hashing. To tackle both limitations, we propose an unsupervised Transformation-invariant Binary Local Descriptor learning method (TBLD). Specifically, the transformation invariance of binary local descriptors is ensured by projecting the original patches and their transformed counterparts into an identical high-dimensional feature space and an identical low-dimensional descriptor space simultaneously. Meanwhile, it enforces the dissimilar image patches to have distinctive binary local descriptors. Moreover, to reduce high correlations between bits, we propose a bottom-up learning strategy, termed Adversarial Constraint Module , where low-coupling binary codes are introduced externally to guide the learning of binary local descriptors. With the aid of the Wasserstein loss, the framework is optimized to encourage the distribution of the generated binary local descriptors to mimic that of the introduced low-coupling binary codes, eventually making the former more low-coupling. Experimental results on three benchmark datasets well demonstrate the superiority of the proposed method over the state-of-the-art methods

    On Aggregation of Unsupervised Deep Binary Descriptor with Weak Bits

    Get PDF
    Despite the thrilling success achieved by existing binary descriptors, most of them are still in the mire of three limitations: 1) vulnerable to the geometric transformations; 2) incapable of preserving the manifold structure when learning binary codes; 3) NO guarantee to find the true match if multiple candidates happen to have the same Hamming distance to a given query. All these together make the binary descriptor less effective, given large-scale visual recognition tasks. In this paper, we propose a novel learning-based feature descriptor, namely Unsupervised Deep Binary Descriptor (UDBD), which learns transformation invariant binary descriptors via projecting the original data and their transformed sets into a joint binary space. Moreover, we involve a â„“2,1-norm loss term in the binary embedding process to gain simultaneously the robustness against data noises and less probability of mistakenly flipping bits of the binary descriptor, on top of it, a graph constraint is used to preserve the original manifold structure in the binary space. Furthermore, a weak bit mechanism is adopted to find the real match from candidates sharing the same minimum Hamming distance, thus enhancing matching performance. Extensive experimental results on public datasets show the superiority of UDBD in terms of matching and retrieval accuracy over state-of-the-arts

    IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval

    Full text link
    Enabling bi-directional retrieval of images and texts is important for understanding the correspondence between vision and language. Existing methods leverage the attention mechanism to explore such correspondence in a fine-grained manner. However, most of them consider all semantics equally and thus align them uniformly, regardless of their diverse complexities. In fact, semantics are diverse (i.e. involving different kinds of semantic concepts), and humans usually follow a latent structure to combine them into understandable languages. It may be difficult to optimally capture such sophisticated correspondences in existing methods. In this paper, to address such a deficiency, we propose an Iterative Matching with Recurrent Attention Memory (IMRAM) method, in which correspondences between images and texts are captured with multiple steps of alignments. Specifically, we introduce an iterative matching scheme to explore such fine-grained correspondence progressively. A memory distillation unit is used to refine alignment knowledge from early steps to later ones. Experiment results on three benchmark datasets, i.e. Flickr8K, Flickr30K, and MS COCO, show that our IMRAM achieves state-of-the-art performance, well demonstrating its effectiveness. Experiments on a practical business advertisement dataset, named \Ads{}, further validates the applicability of our method in practical scenarios.Comment: 9 pages; Accepted by CVPR202
    corecore